237 research outputs found

    Genome signatures, self-organizing maps and higher order phylogenies: a parametric analysis

    Get PDF
    Genome signatures are data vectors derived from the compositional statistics of DNA. The self-organizing map (SOM) is a neural network method for the conceptualisation of relationships within complex data, such as genome signatures. The various parameters of the SOM training phase are investigated for their effect on the accuracy of the resulting output map. It is concluded that larger SOMs, as well as taking longer to train, are less sensitive in phylogenetic classification of unknown DNA sequences. However, where a classification can be made, a larger SOM is more accurate. Increasing the number of iterations in the training phase of the SOM only slightly increases accuracy, without improving sensitivity. The optimal length of the DNA sequence k-mer from which the genome signature should be derived is 4 or 5, but shorter values are almost as effective. In general, these results indicate that small, rapidly trained SOMs are generally as good as larger, longer trained ones for the analysis of genome signatures. These results may also be more generally applicable to the use of SOMs for other complex data sets, such as microarray data

    Less is more: the battle of Moore's law against Bremermann's limit on the field of systems biology

    Get PDF
    Background I run my bioinformatics tasks on two machines. The first one is a Tru64 DS20 AlphaServer, bought in 1999. This has two processors running at 512 MHz with 2 Gb of memory. The second is a custom-built Linux box, purchased in early 2005, which has 8 processors running at 2.7 GHz and 12 Gb of memory. Although performance speed does not quite scale linearly against processor speed, this represents just over a 5-fold increase in computing power over a period of 6 years. This kind of thing has been happening since the 1960s, when Intel-founder Gordon Moore observed that available processing power doubles every 2 years or so. Indeed, my "fast" Linux box is already quite pedestrian compared to the 3.8 GHz processors that are now routinely available. Under these circumstances, it is easy to become complacent about handling awkward jobs. Large, viral-genome-scale ClustalW jobs that used to run for several days on my DS20 are now finished overnight. Within the horizon of a typical 3-year scientific project, I could conceivably be able to buy a machine that shortens the time to a couple of hours or so. But just because bioinformatics tasks are now becoming increasingly trivial in terms of computer time, does that mean we can expect similar gains in systems biology problems? There is a whole industry of popular science books which attempt to persuade us that "Moore's Law" will be the basis of a future world in which anything is computable almost instantly, and we will, even within our lifetimes, know "everything". How seriously should we take such claims and what relevance do they have to research in the here-and-now

    Letter: TreeAdder: a tool to assist the optimal positioning of a new leaf into an existing phylogenetic tree

    Get PDF
    TreeAdder is a computer application that adds a leaf in all possible positions on a phylogenetic tree. The resulting set of trees represent a dataset appropriate for maximum likelihood calculation of the optimal tree. TreeAdder therefore provides a utility for what was previously a tedious and error-prone process

    Gene Expression Studies in First Trimester Embryogenesis

    Get PDF
    Imperial Users onl

    Evolution of the G+C content frontier in the rat cytomegalovirus genome

    Get PDF
    Within the 230138 bp of the rat cytomegalovirus (RCMV) genome, the G+C content changes abruptly at position 142644, constituting a G+C content frontier. To the left of this point, overall G+C content is 69.2%, and to the right it is only 47.6%. A region of extremely low G+C content (33.8%) is found in the 5 kb immediately to the right of the frontier, in which there are no predicted coding sequences. To the right of position 147501, the G+C content rises and predicted coding sequences reappear. However, these genes are much shorter (average 848bp, 50% G+C) than those in the left two-thirds of the genome (average 1462bp, 70% G+C). Whole genome alignment of several viruses indicates that the initial ultra-low G+C region appeared in the common ancestor of the genera Cytomegalovirus and Muromegalovirus, and that the lowering of G+C in the right third has been a subsequent process in the lineage leading to RCMV. The left two-thirds of RCMV has stop codon occurrences at 67.5% of their expected level, based on a modified Markov chain model of stop codon distribution, and the corresponding figure for the right third is 78%. Therefore, despite heavy mutation pressure, selective constraint has operated in the right third of the RCMV genome to maintain a degree of gene length unusual for such low G+C sequences

    Comparison of Eurovision Song Contest Simulation with Actual Results Reveals Shifting Patterns of Collusive Voting Alliances.

    Get PDF
    The voting patterns in the Eurovision Song Contest have attracted attention from various researchers, spawning a small cross-disciplinary field of what might be called 'eurovisiopsephology' incorporating insights from politics, sociology and computer science. Although the outcome of the contest is decided using a simple electoral system, its single parameter - the number of countries casting a vote - varies from year to year. Analytical identification of statistically significant trends in voting patterns over a period of several years is therefore mathematically complex. Simulation provides a method for reconstructing the contest's history using Monte Carlo methods. Comparison of simulated histories with the actual history of the contest allows the identification of statistically significant changes in patterns of voting behaviour, without requiring a full mathematical solution. In particular, the period since the mid-90s has seen the emergence of large geographical voting blocs from previously small voting partnerships, which initially appeared in the early 90s. On at least two occasions, the outcome of the contest has been crucially affected by voting blocs. The structure of these blocs implies that a handful of centrally placed countries have a higher probability of being future winners.Simulation, Perl, Eurovision Song Contest, Voting Blocs, Collusive Voting

    Less is more: the battle of Moore's law against Bremermann's limit on the field of systems biology

    Get PDF
    Background I run my bioinformatics tasks on two machines. The first one is a Tru64 DS20 AlphaServer, bought in 1999. This has two processors running at 512 MHz with 2 Gb of memory. The second is a custom-built Linux box, purchased in early 2005, which has 8 processors running at 2.7 GHz and 12 Gb of memory. Although performance speed does not quite scale linearly against processor speed, this represents just over a 5-fold increase in computing power over a period of 6 years. This kind of thing has been happening since the 1960s, when Intel-founder Gordon Moore observed that available processing power doubles every 2 years or so. Indeed, my "fast" Linux box is already quite pedestrian compared to the 3.8 GHz processors that are now routinely available. Under these circumstances, it is easy to become complacent about handling awkward jobs. Large, viral-genome-scale ClustalW jobs that used to run for several days on my DS20 are now finished overnight. Within the horizon of a typical 3-year scientific project, I could conceivably be able to buy a machine that shortens the time to a couple of hours or so. But just because bioinformatics tasks are now becoming increasingly trivial in terms of computer time, does that mean we can expect similar gains in systems biology problems? There is a whole industry of popular science books which attempt to persuade us that "Moore's Law" will be the basis of a future world in which anything is computable almost instantly, and we will, even within our lifetimes, know "everything". How seriously should we take such claims and what relevance do they have to research in the here-and-now

    Voting in Eurovision: shared tastes or cultural epidemic?

    Get PDF
    Apparent vote-exchange ("logrolling") in the Eurovision Song Contest has been variously interpreted as a manifestation of political attitudes within Europe, a reflection of regional tastes in pop music, or a social (memetic) epidemic. This paper provides data supporting the third of these three options, also demonstrating that the cultural contagion has now nearly reached saturation. As well as logrolling, ethnic diasporas and the "semi-final effect" are also shown to influence the result of the contest. Reform of the voting system to produce a contest which better rewards musical excellence, without suppressing the mass participation element, is therefore a complex problem

    Modelling the structure of full-length Epstein-Barr virus nuclear antigen 1

    Get PDF
    Epstein-Barr virus (EBV) is a clinically important human virus associated with several cancers and is the etiologic agent of infectious mononucleosis. The viral nuclear antigen-1 (EBNA1) is central to the replication and propagation of the viral genome and likely contributes to tumourigenesis. We have compared EBNA1 homologues from other primate lymphocryptoviruses (LCV) and found that the central glycine/alanine repeat (GAr) domain, as well as predicted cellular protein (USP7 and CK2) binding sites are present in homologues in the Old World primates, but not the marmoset; suggesting that these motifs may have co-evolved. Using the resolved structure of the C-terminal one third of EBNA1 (homodimerisation and DNA binding domain), we have gone on to develop monomeric and dimeric models in silico of the full length protein. The C-terminal domain is predicted to be structurally highly similar between homologues, indicating conserved function. Zinc could be stably incorporated into the model, bonding with two N-terminal cysteines predicted to facilitate multimerisation. The GAr contains secondary structural elements in the models, while the protein binding regions are unstructured, irrespective of the prediction approach used and sequence origin. These intrinsically disordered regions may facilitate the diversity observed in partner interactions. We hypothsise that the structured GAr could mask the disordered regions, thereby protecting the protein from default degradation. In the dimer conformation, the C-terminal tails of each monomer wrap around a proline-rich protruding loop of the partner monomer, providing dimer stability, a feature which could be exploited in therapeutic design
    corecore